Game Analytics: From Exploratory Data Analysis to Predictive Modeling
Author
Hoang Son Lai
Published
November 17, 2025
Introduction
The modern gaming landscape is fiercely competitive, where player retention and engagement are the ultimate currencies. Success is no longer solely determined by creative design and immersive gameplay but increasingly by the ability to understand and adapt to player behavior. This project, “Game Analytics: From Exploratory Data Analysis to Predictive Modeling,” demonstrates this data-driven paradigm by conducting a comprehensive analysis of Flappy Plane Adventure, a dynamic side-scrolling shooter.
Leveraging a rich dataset of 300 game sessions, this study moves beyond traditional descriptive statistics to uncover the deep-seated patterns that govern player success and failure. My journey begins with a thorough Exploratory Data Analysis (EDA), where I visualize performance distributions, identify the most common obstacles, and engineer advanced behavioral features such as aggressiveness, efficiency, and risk-taking to quantify playstyles.
I then tackle the challenge of a limited dataset through bootstrapping, artificially expanding my training data to build more robust and generalizable machine learning models. This foundation allows me to segment the player base into distinct behavioral profiles using unsupervised learning (K-Means Clustering), revealing clear archetypes from hesitant Beginners to seasoned Experts.
The core of this investigation lies in supervised predictive modeling. I develop and compare multiple algorithms to:
Predict final scores with near-perfect accuracy using a Random Forest regressor.
Forecast player survival beyond a critical 30-second threshold.
Anticipate the cause of a player’s death through a multiclass classification model.
Ultimately, this report transcends a mere technical exercise. Each model and visualization is meticulously interpreted to generate actionable, evidence-based recommendations for game balancing, targeted player engagement, and strategic monetization. My goal is to provide a clear blueprint for how data science can be practically applied to create a more enjoyable, balanced, and commercially successful gaming experience.
1. Data Overview & Processing
The data preparation stage begins by loading the raw game session CSV and converting timestamp strings into POSIX datetime objects for start_time and end_time. Missing or problematic values are handled (for example game_duration is set to 0 where missing), and several derived metrics are computed: score_per_second (score divided by duration) and accuracy (UFOs shot divided by bullets fired).
Code
# Load and clean the datagame_data <-read.csv("data/game_sessions.csv", stringsAsFactors =FALSE)# Data cleaning and preprocessinggame_data_clean <- game_data %>%mutate(start_time =as.POSIXct(start_time, format ="%Y-%m-%dT%H:%M:%OSZ"),end_time =as.POSIXct(end_time, format ="%Y-%m-%dT%H:%M:%OSZ"),death_reason =as.factor(death_reason),# Handle missing end_timegame_duration =ifelse(is.na(game_duration), 0, game_duration),# Create performance metricsscore_per_second =ifelse(game_duration >0, score / game_duration, 0),accuracy =ifelse(bullets_fired >0, ufos_shot / bullets_fired, 0) ) %>%filter(!is.na(start_time))
variable_description <-tibble(Variable =c("id","start_time","end_time","score","coins_collected","ufos_shot","bullets_fired","death_reason","game_duration","pipes_passed","score_per_second","accuracy" ),Description =c("Unique session identifier","Timestamp when the game session started","Timestamp when the game session ended","Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)","Number of coins collected by the player","Number of UFO enemies shot","Total number of bullets fired","Cause of death (collision type / hazard)","Total session duration in seconds","Number of pipes the player successfully passed","Score normalized by session duration (score ÷ game_duration)","Shooting accuracy (ufos_shot ÷ bullets_fired)" ),Type =c("Character","Datetime","Datetime","Integer","Integer","Integer","Integer","Categorical","Numeric","Integer","Numeric","Numeric" ))variable_description %>%gt() %>%tab_header(title =md("**Variable Description - Plane Game Analytics**") ) %>%cols_width( Variable ~px(160), Description ~px(420), Type ~px(120) ) %>%tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels() )
Table 1
Variable Description - Plane Game Analytics
Variable
Description
Type
id
Unique session identifier
Character
start_time
Timestamp when the game session started
Datetime
end_time
Timestamp when the game session ended
Datetime
score
Final score achieved in the session. Score = coins_collected + (3 × ufos_shot)
Integer
coins_collected
Number of coins collected by the player
Integer
ufos_shot
Number of UFO enemies shot
Integer
bullets_fired
Total number of bullets fired
Integer
death_reason
Cause of death (collision type / hazard)
Categorical
game_duration
Total session duration in seconds
Numeric
pipes_passed
Number of pipes the player successfully passed
Integer
score_per_second
Score normalized by session duration (score ÷ game_duration)
Exploratory Data Analysis (EDA) is the process of visually and statistically examining the dataset to uncover patterns. This section delves deep into the player data to understand core behaviors and outcomes. It begins with an overview of the distribution of key performance metrics, then investigates the most common reasons for game failure. Finally, it explores the relationships and correlations between different variables to understand how they influence one another.
Figure 1: Distribution of Game Performance Metrics
Figure 1 shows that the distributions for Score, Game Duration, Coins Collected, and Ufos Shot are all strongly right-skewed. This indicates that the vast majority of game sessions are short and result in low scores, which is a common characteristic of challenging, skill-based games. Most players fail early, while only a few achieve high scores and long playtimes. The Accuracy metric shows a more spread-out distribution but is still concentrated towards the lower values.
Figure 2 clearly shows that colliding with a “Pipe” is overwhelmingly the most common reason for a game to end. The second most frequent cause is hitting the “Ground”.
Figure 3: Correlation Matrix for Key Performance Metrics
Figure 3 presents a heatmap illustrating the correlations between key performance metrics. The lighter shades of green indicate a strong positive relationship. As expected, score is highly correlated with its core components: game_duration, coins_collected, ufos_shot, and pipes_passed. This confirms that the game’s internal scoring logic is sound - players who survive longer and engage with game elements successfully achieve higher scores. An equally important insight comes from the accuracy variable, which shows dark, weak correlations with nearly all other metrics. This suggests that shooting accuracy is an independent skill that is not strongly tied to how long a player survives or how many points they accumulate through other means.
2.2 Death Reason Deep-Dive
Code
# Survival Timeline by Death Reasonformat_bin <-function(x) { x <-gsub("\\(", "", x) x <-gsub("\\]", "", x) x <-gsub("\\[", "", x) x <-gsub("\\)", "", x) x <-gsub(",", "-", x) x}game_data_binned <- game_data_clean %>%mutate(duration_bin =cut(game_duration,breaks =seq(0, 140, by =5),include.lowest =TRUE)) %>%filter(!is.na(duration_bin)) %>%mutate(duration_label =format_bin(as.character(duration_bin))) %>%count(duration_label, death_reason, name ="count")duration_levels <-format_bin(as.character(levels(cut(seq(0, 100, by =5),breaks =seq(0, 140, by =5),include.lowest =TRUE))))game_data_binned$duration_label <-factor(game_data_binned$duration_label,levels = duration_levels)timeline_plot <-ggplot( game_data_binned,aes(x = duration_label,y = count,color = death_reason,group = death_reason,text =paste0("<b>Death Reason:</b> ", death_reason, "<br>","<b>Duration:</b> ", duration_label, " sec<br>","<b>Count:</b> ", count ) )) +geom_line(size =0.7) +geom_point(size =1.5) +labs(title ="Survival Timeline by Death Reason",x ="Game Duration (seconds)",y ="Number of Deaths",color ="Death Reason" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, margin =margin(t =5)),axis.text.y =element_text(margin =margin(r =5)) )ggplotly(timeline_plot, tooltip ="text") %>%layout(title =list(text ="<b>Survival Timeline by Death Reason</b>", x =0.5, xanchor ="center",font =list(size =17) ),legend =list(orientation ="h",x =0.5,xanchor ="center",y =-0.25,yanchor ="top" ),xaxis =list(title_standoff =20 ),yaxis =list(title_standoff =20 ),margin =list(b =160) )
Figure 4: Survival Timeline by Death Reason
Figure 4 provides a dynamic view of how different death reasons occur over time. The “Ground” and “Pipe” deaths are most frequent in the very early stages of the game (0-10 seconds), indicating these are the first major hurdles for new players. In contrast, deaths from “Enemy Bullet” become more prominent as the game duration increases, suggesting that enemies pose a greater threat to more experienced players who have mastered the basic pipe navigation.
Code
# Distribution of score by death_reasonstats <- game_data_clean %>%group_by(death_reason) %>%summarise(count =n(),mean =mean(score),min =min(score),q1 =quantile(score, 0.25),median=median(score),q3 =quantile(score, 0.75),max =max(score) )df <-left_join(game_data_clean, stats, by ="death_reason")p <-plot_ly()unique_reasons <-unique(df$death_reason)for (dr in unique_reasons) { dsub <- df %>%filter(death_reason == dr) cd <-as.matrix(dsub[, c("count","mean","min","q1","median","q3","max")]) p <-add_trace( p,data = dsub,x =~death_reason,y =~score,type ="violin",name = dr,box =list(visible =TRUE),meanline =list(visible =TRUE),customdata = cd,hovertemplate =paste("<b>Death reason:</b> ", dr, "<br>","<b>Score:</b> %{y}<br><br>","<b>Count:</b> %{customdata[0]}<br>","<b>Mean:</b> %{customdata[1]:.2f}<br>","<b>Min:</b> %{customdata[2]}<br>","<b>Q1:</b> %{customdata[3]}<br>","<b>Median:</b> %{customdata[4]}<br>","<b>Q3:</b> %{customdata[5]}<br>","<b>Max:</b> %{customdata[6]}<extra></extra>" ) )} p %>%layout(title ="Score Distribution by Death Reason",xaxis =list(title ="Death Reason"),yaxis =list(title ="Score"))
Figure 5: Score Distribution by Death Reason
Figure 5 provides a powerful comparison of player performance at the moment of failure, revealing a clear hierarchy of challenges. The distributions show that not all deaths are equal in terms of the skill level they represent:
Novice Failures: Dying by hitting the ground is associated with the lowest possible scores, with the distribution almost entirely concentrated at zero. This represents an immediate failure to grasp the basic flight mechanic. Similarly, ceiling collisions happen at very low scores.
Advanced Challenges: In stark contrast, deaths caused by enemy_bullet and ufo_collision are associated with significantly higher median scores. The box plots for these two categories are clearly elevated, indicating that only players who have already survived the initial obstacles and achieved a high score even encounter these threats. Dying to an enemy is a hallmark of a high-performing player pushing the limits of their skill.
The Universal Obstacle: The distribution for pipe collisions is unique. It has a low median score, confirming it’s a frequent cause of failure for less experienced players. However, its long upper tail, extending to the maximum score, shows that even the most expert players are not immune, making pipes the universal challenge that affects players at all skill levels.
Code
# Expected Value of Score Lost per Death Typeev_loss <- game_data_clean %>%group_by(death_reason) %>%rename(`Death reason`= death_reason) %>%summarise(`Mean score`=mean(score),`Median score`=median(score),`Count of deaths`=n(),.groups ='drop' ) %>%arrange(desc(`Mean score`))ev_loss %>%kable()
Table 4: Expected Value of Score per Death Reason
Death reason
Mean score
Median score
Count of deaths
ufo_collision
30.3333333
30.0
3
enemy_bullet
28.2812500
22.5
32
pipe
14.5054945
8.0
182
ceiling
10.1250000
3.0
8
ground
0.8133333
0.0
75
Table 4 provides a clear statistical summary of the skill level associated with each cause of failure. The data reveals a stark contrast between advanced threats and novice hurdles: deaths from ufo_collision (mean score: 30.3) and enemy_bullet (mean score: 28.3) happen to high-performing players, while failing by hitting the ground (mean score: 0.8, median: 0.0) is the definitive mark of a beginner. Positioned between these, pipe collisions represent the primary mid-game obstacle, being the most frequent cause of death (182 instances) with a moderate mean score of 14.5.
2.3 Behavioral Feature Engineering
To capture nuanced player strategies, I engineered eight behavioural features, including aggressiveness (bullets fired per second), efficiency (score per second), risk_taking (UFOs shot per pipe passed), and various rate-based metrics. These features transform raw action counts into meaningful behavioural patterns that more accurately represent player decision-making.
Figure 6 provides a detailed look at how different player strategies relate to one another. The values and colors (deep blue for strong positive correlation, white for no correlation, red for negative correlation) reveal the core mechanics of successful play:
The “Winning” Strategy is High-Risk, High-Reward: The strongest correlations exist between efficiency, risk_taking, and ufo_rate (correlation values of 0.89 to 0.97). This is a critical insight: players who are the most efficient (highest score-per-second) are precisely those who take risks to engage UFOs. The game heavily rewards an active, combat-oriented playstyle over a passive, purely survival-focused one.
Aggressiveness is Independent of Accuracy: There is virtually no correlation (0.03) between aggressiveness (how often a player fires) and accuracy. This finding debunks the common assumption that firing more rapidly (“spraying”) would lead to lower accuracy. It suggests that skilled players can maintain their accuracy even at a high rate of fire, and unskilled players are inaccurate regardless of how often they shoot. The two are independent skills.
Negligible Strategic Trade-offs: The only negative correlation on the chart is between coin_rate and aggressiveness (-0.08), which is extremely weak. This indicates there is no significant trade-off between focusing on shooting and focusing on collecting coins; skilled players appear to do both effectively.
Overall, this matrix clearly demonstrates that success in the game is not just about survival, but about efficient, risk-taking engagement with enemies.
2.4. Session Progression Analysis
This section analyzes the dynamics within a game session, focusing on how performance metrics evolve over time and in relation to each other. Instead of just looking at final outcomes, these plots examine the journey. The goal is to understand the relationship between survival time and score accumulation, as well as the efficiency of player actions like shooting.
Code
# (A) Score vs Duration with trendlinescore_duration_plot <-ggplot(game_data_enhanced, aes(x = game_duration, y = score)) +geom_point(alpha =0.6, color ="#1f77b4") +geom_smooth(method ="loess", color ="#ff7f0e", se =TRUE) +labs(title ="Score vs Game Duration with Trendline",x ="Game Duration (seconds)",y ="Score") +theme_minimal()ggplotly(score_duration_plot)
Figure 7: Score vs Game Duration with Trendline
Code
# (B) Bullets vs UFO Shot efficiency_plot <-ggplot(game_data_enhanced,aes(x = bullets_fired, y = ufos_shot, color = skill_tier)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Bullets Fired vs UFOs Shot",x ="Bullets Fired", y ="UFOs Shot",color ="Skill Tier (by Score)") +theme_minimal()ggplotly(efficiency_plot)
Figure 8: Bullets Fired vs UFOs Shot
Figure 7 reveals a strong, positive, and non-linear relationship between how long a player survives (game_duration) and their final score. The upward curve of the trendline indicates an accelerating return: the longer a player survives, the more rapidly their score increases per second. This suggests that skilled players who survive longer are not just accumulating points over more time, but are also becoming more effective at scoring as they encounter more opportunities (UFOs, coins). Furthermore, the widening “cone” shape of the data points and the broadening confidence interval show that while survival is a necessary condition for a high score, there is a much greater variance in scoring ability among long-lasting players
Figure 8 provides a powerful visualization of shooting efficiency, segmented by Skill Tier. The key insight lies in the slope of the trendlines for each player group:
High-Skill Tiers (e.g., Pink: 50+ Score): The trendline for the most skilled players is extremely steep. This demonstrates a very high efficiency: a small increase in bullets fired results in a large increase in UFOs shot.
Low-Skill Tiers (e.g., Red: 0-4 Score): The trendlines for the least skilled players are nearly flat. They may fire a moderate number of bullets, but they achieve almost no successful hits. This visually separates raw activity (firing bullets) from effective outcomes (hitting targets). In essence, the chart proves that simply being aggressive is not enough; success is defined by the efficiency of that aggression, a trait clearly demonstrated by the higher skill tiers.
2.5 Clusterable Structure Check
Before attempting to segment the player base, it’s essential to determine if the data contains meaningful, inherent structures. This section uses Principal Component Analysis (PCA), a powerful dimensionality reduction technique, to achieve this. PCA condenses multiple behavioral and performance variables into two principal components (PC1 and PC2) that capture the majority of the data’s variance. By plotting these components, we can visually inspect the data for natural groupings, which validates the use of a clustering algorithm like K-Means in the next section.
Code
cluster_data <- game_data_enhanced %>%select(score, game_duration, coins_collected, bullets_fired, ufos_shot, pipes_passed, aggressiveness, efficiency, accuracy, risk_taking) %>%scale()pca_result <-prcomp(cluster_data, scale. =TRUE)# PCA Loadings - Meaning of Principal componentspca_loadings <-as.data.frame(pca_result$rotation[, 1:2])pca_loadings$feature <-rownames(pca_loadings)loadings_plot <-ggplot(pca_loadings, aes(x = PC1, y = PC2, label = feature)) +geom_point(size =3, color ="blue") +geom_text_repel(size =4, max.overlaps =10) +geom_vline(xintercept =0, linetype ="dashed", alpha =0.5) +geom_hline(yintercept =0, linetype ="dashed", alpha =0.5) +labs(title ="PCA Loadings - Meaning of Principal Components",x =paste0("PC1 (", round(100* pca_result$sdev[1]^2/sum(pca_result$sdev^2), 1), "%)"),y =paste0("PC2 (", round(100* pca_result$sdev[2]^2/sum(pca_result$sdev^2), 1), "%)")) +theme_minimal()loadings_plot
Figure 9: PCA Loadings - Meaning of Principal Components
Figure 9 is essential for interpreting the PCA results, as it explains the meaning behind the two principal components that summarize 87.3% of the player behavior variance. Each point represents an original variable, and its position reveals its contribution to PC1 and PC2.
PC1 - “Progression & Skill” (60.8%): This interpretation remains correct. All the variables directly associated with successful outcomes—score, game_duration, pipes_passed, coins_collected, ufos_shot, and bullets_fired—have large, positive loadings on the x-axis. This means that a player’s position along the horizontal axis is a direct and powerful measure of their overall skill and progression within a single game session. Moving from left to right signifies a transition from a low-performing session to a high-performing one.
PC2 - “Playstyle: Passive Survival vs. Active Combat” (26.5%): This is where the crucial correction lies.
The variables with positive loadings (pointing upwards) are primarily the raw accumulation metrics: game_duration, coins_collected, and pipes_passed. These represent a playstyle focused on longevity and steady progress. A high score on this axis indicates a “Passive Survival” approach, where the main goal is to dodge obstacles and last as long as possible.
The variables with negative loadings (pointing downwards) are the key behavioral ratios: aggressiveness, risk_taking, efficiency, and, critically, accuracy. These metrics measure the intensity and effectiveness of a player’s actions. A low score on this axis indicates an “Active Combat” or “High-Efficiency” playstyle, where the player is actively engaging with enemies, taking risks to shoot UFOs, and maximizing their score-per-second.
Figure 10: PCA - Player Behavior Patterns by Death Reason
Figure 10 visualises the game sessions on a 2D map defined by player skill and playstyle. With the context from the PCA Loadings, the distribution of each death_reason tells a compelling story about the player journey:
The Novice Zone (Far Left):
ground (dark blue): Tightly clustered in the upper-left quadrant. This represents players with very low skill (low PC1) who adopt a Passive Survival playstyle (high PC2) but fail by not acting enough, letting the plane fall.
ceiling (light blue/teal): Tightly clustered in the lower-left quadrant. This represents players with very low skill (low PC1) who adopt an Active Combat or overly-aggressive playstyle (low PC2) and fail by acting too much, crashing into the ceiling. These two groups represent the two classic types of beginner mistakes: inaction vs. overreaction.
The Intermediate Challenge (Center):
pipe (pink): These points are spread throughout the center of the plot. This visually confirms that pipe collisions are the universal challenge that players must overcome to transition from novice to skilled. They affect players of all playstyles.
The Advanced Zone (Far Right):
ufo_collision (light green) and enemy_bullet (orange): These points dominate the right side of the plot, representing the most skilled and highest-progressing players. Crucially, they are almost entirely in the lower-right quadrant. This signifies that the most successful players are those who employ a highly effective Active Combat strategy. Dying by an enemy bullet or a ufo collision is, paradoxically, a sign of a high-skill player who has survived long enough to face the game’s most difficult threats.
3. Bootstrapping Data for Machine Learning
The original dataset, with 300 samples, is relatively small for training complex machine learning models. To create a more robust training set and mitigate the risk of overfitting, this section employs bootstrapping. This statistical technique involves resampling the original training data (250 samples) with replacement to generate a much larger, synthetic dataset (10,000 samples). This new dataset retains the statistical properties of the original data while providing more examples for the models to learn from. The remaining 50 original samples are kept separate as a “holdout” test set for unbiased model evaluation.
Code
# Sort by time and split datagame_sorted <- game_data_enhanced %>%arrange(start_time)train_base <-head(game_sorted, 250)test_holdout <-tail(game_sorted, 50)# Bootstrap training dataset.seed(123)bootstrap_size <-10000train_bootstrapped <- train_base %>%slice_sample(n = bootstrap_size, replace =TRUE) %>%mutate(is_synthetic =TRUE)cat("Original Train Size:", nrow(train_base), "\n")
cat("Holdout Test Size:", nrow(test_holdout), "\n")
Holdout Test Size: 50
Code
# Compare distribution real vs bootstrappedcompare_plot <-ggplot() +geom_density(data = train_base, aes(x = score, color ="Real"), size =1) +geom_density(data = train_bootstrapped, aes(x = score, color ="Bootstrapped"), size =1, alpha =0.7) +labs(title ="Score Distribution: Real vs Bootstrapped Data",x ="Score", y ="Density", color ="Data Type") +theme_minimal()compare_plot
Figure 11: Score Distribution: Real vs Bootstrapped Data
This density plot is a critical validation step. It overlays the distribution of the score variable from the original training data (“Real”) with the distribution from the new synthetic data (“Bootstrapped”). The two curves align almost perfectly, which confirms that the bootstrapping process was successful. The synthetic data accurately mirrors the statistical characteristics of the original data, making it a reliable and larger dataset for model training.
4. Segmentation (Unsupervised Learning)
This section uses unsupervised learning to discover natural groupings or “personas” among the players based on their in-game behavior, without any predefined labels. By applying the K-Means clustering algorithm to key behavioral and performance metrics, the goal is to segment the player base into a few distinct clusters. Analyzing the characteristics of each cluster can reveal different types of playstyles and skill levels.
Code
# Features for clusteringcluster_features_enhanced <- train_bootstrapped %>%select(score, game_duration, coins_collected, bullets_fired, ufos_shot, pipes_passed, aggressiveness, efficiency, accuracy, risk_taking)scaled_features_enhanced <-scale(cluster_features_enhanced)# KMeans clustering with 3 clustersset.seed(123)kmeans_enhanced <-kmeans(scaled_features_enhanced, centers =3, nstart =25)train_bootstrapped$cluster_enhanced <-as.factor(kmeans_enhanced$cluster)# Visualize clustersfviz_cluster(kmeans_enhanced, data = scaled_features_enhanced,geom ="point", ellipse.type ="convex",ggtheme =theme_minimal(),main ="Enhanced Player Segmentation (K-Means)")
Figure 12: Enhanced Player Segmentation (K-Means)
Figure 12 visualizes the results of the K-Means clustering. Each point represents a game session, and its color corresponds to the cluster it was assigned to. The data is plotted on the first two principal components (Dim1 and Dim2), which capture the most variance in the data. The clear separation between the three colored groups (red, green, and blue) indicates that the algorithm successfully identified three distinct patterns of player behavior.
Table 5 provides a quantitative summary of the three identified clusters, showing the average value of key metrics for each group. This is where the player personas become clear:
Cluster 1 (956 samples): The “Experts”. This group has a very high average score (52.5), a long game duration (37s), and high numbers for coins, UFOs shot, and bullets fired. They are highly skilled and engaged players.
Cluster 2 (3061 samples): The “Novices”. This group has an extremely low average score (0.4) and a very short game duration (1.4s). These are likely new players who fail almost immediately and are at the highest risk of churning.
Cluster 3 (5983 samples): The “Average Players”. This group sits between the other two, with a moderate average score (10.8) and game duration (7.5s). They represent the bulk of the player base who have passed the initial learning curve but have not yet reached expert level.
5. Predictive Modeling (Supervised Learning)
Leveraging the insights and the enhanced dataset, this section focuses on building predictive models using supervised learning. The objective is to tackle three distinct business problems: predicting a player’s final score (Score Regression), predicting whether a player will survive longer than a 30-second threshold (Survival Prediction), and predicting the specific cause of death (Multiclass Classification). Advanced algorithms like Random Forest and XGBoost are trained on the bootstrapped data and evaluated on the holdout test set.
5.1 Score Regression
Code
# Define featuresfeatures_enhanced <-c("coins_collected", "ufos_shot", "bullets_fired", "game_duration", "pipes_passed", "aggressiveness","efficiency", "accuracy", "risk_taking")# Random Forest with enhanced featuresrf_enhanced <-randomForest(as.formula(paste("score ~", paste(features_enhanced, collapse ="+"))),data = train_bootstrapped,ntree =100,importance =TRUE)# Predict and assesspredictions_rf_enhanced <-predict(rf_enhanced, newdata = test_holdout)rmse_enhanced <-RMSE(predictions_rf_enhanced, test_holdout$score)r2_enhanced <-R2(predictions_rf_enhanced, test_holdout$score)cat("Enhanced Random Forest Performance:\n")
Enhanced Random Forest Performance:
Code
cat("RMSE:", round(rmse_enhanced, 2), "\n")
RMSE: 0.84
Code
cat("R-Squared:", round(r2_enhanced, 4), "\n")
R-Squared: 0.9961
Code
# Variable importancevarImpPlot(rf_enhanced, main ="Enhanced Feature Importance for Score Prediction")
Figure 13: Enhanced Feature Importance for Score Prediction
ceiling enemy_bullet ground pipe ufo_collision
0.333 0.250 0.667 0.840 NaN
6. Business Insights & Recommendations
Based on the analysis above, we derive the following actionable insights:
6.1. Difficulty Balancing:
Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.
Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.
6.2. Player Segmentation Strategy:
Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).
Recommendation: Introduce targeted rewards.
For ‘Survivors’ (High duration, low action): Introduce time-based achievements.
For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.
6.3. Predictive Engagement:
Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.
Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.
6.4. Monetization Opportunities:
Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.
Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.